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Introduction 


In  many  areas  of  science,  problems  of  interest,  both  practical  and  theoret¬ 
ical,  consist  in  determining  a  configuration  of  a  set  of  parameters  that  is 
“optimal”  with  respect  to  some  cost  function.  Therefore,  a  variety  of  mini¬ 
mization  algorithms  have  been  developed  for  finding  the  global  minimum  of 
a  multivariate  function.  In  this  short  note  we  describe  an  algorithm  that  we 
have  successfully  used  to  solve  a  class  of  minimization  problems  that  arise  in 
the  study  of  the  problem  of  learning  from  examples. 

The  cost  functions  that  we  consider,  an  explicit  form  of  which  will  be 
presented  in  the  next  section,  are  defined  in  high  dimensional  spaces  and 
usually  show  many  local  minima.  Standard  descent  techniques,  as  gradient 
descent  or  conjugate  gradient,  (Polak,  1971;  Acton,  1970;  Press  et  al.,  1987), 
are  of  limited  usefulness  in  this  case,  since  they  cannot  escape  local  minima. 
Moreover,  although  the  analytical  expression  of  the  gradient  is  available, 
its  computation  can  be  very  time  consuming,  and  it  would  be  preferable  to 
avoid  it.  Therefore,  it  may  be  better  to  use  nondeterministic  techniques,  of 
the  Metropolis  type,  (Metropolis  et  ah,  1953),  that  usua'ly  do  not  require  the 
computation  of  derivatives,  and  are  more  suited  to  deal  with  the  problem  of 
local  minima.  A  successful  nondeterministic  method  is  simulated  annealing 
(Kirpatrick  et  ah,  1983),  which  however  requires  the  difficult  choice  of  an 
annealing  schedule,  and  has  a  high  computational  cost.  The  algorithm  that 
we  developed  is  nondeterministic,  does  not  require  any  annealing  and  does 
not  have  a  high  computational  cost.  Albeit  we  are  not  guaranteed  to  found 
the  global  minimum,  experimental  results  show  that,  for  the  class  of  functions 
we  are  considering,  good  local  minima  are  usually  found. 

The  algorithm  we  implemented  is  similar  in  spirit  to  many  heuristic  mini¬ 
mization  algorithms,  like  A"  (Hart  et  ah,  1968;  Shapiro,  1987),  the  algorithm 
of  Kernighan  and  Lin  (1970,  1973),  and  the  algorithms  described  by  Pearl 
(1984).  Our  algorithm  is  also  similar  in  spirit  to  some  “evolutionary”  opti¬ 
mization  algorithms,  like  the  ones  described  for  example  by  Wang  (1987)  and 
Fogel  (1988).  It  is  interesting  to  notice  that  most  of  these  algorithms  have 
been  developed  and  used  to  solve  combinatorial  optimization  problems,  like 
the  Traveling  Salesman  Problem  (Lin  and  Kernighan,  1973)  and  rarely  used 
to  minimize  smooth  funrtions.  We  tested  our  algorithm  in  a  variety  of  cases, 
all  belonging  to  the  class  of  minimization  problems  which  will  be  described 
in  the  next  section,  and  we  compared  the  results  with  the  ones  given  by  a 


standard  technique  of  gradient  descent  with  adaptive  step.  In  all  cases  con¬ 
sidered  it  turns  out  that  the  nondeterministic  algorithm  finds  the  best  local 
minima.  Preliminary  experiments  suggest  that  this  could  hold  true  not  only 
for  the  particular  class  of  functions  w'e  are  considering,  but  also  for  a  wider 
class  of  problems. 


2  Regularization  Networks  and  Minimiza¬ 
tion  Problems 

Recently,  Poggio  and  Girosi  (1989,  1990)  described  how  standard  mathemat¬ 
ical  techniques  can  be  used  to  approach  the  problem  of  learning.  In  fact, 
learning  an  input-output  mapping  from  a  set  of  examples  can  be  regarded  as 
synthesizing  an  approximation  of  a  multi-dimensional  function,  that  is  solv¬ 
ing  the  problem  of  surface  reconstruction  from  sparse  data.  Poggio  and  Girosi 
used  a  variational  approach,  based  on  regularization  theory,  (Tikhonov  and 
Arsenin,  1977;  Morozov,  1984;  Bertero,  1986)  and  showed  that  this  problem 
is  equivalent  to  find  the  optimal  “weights”  of  a  particular  type  of  network 
with  one  layer  of  hidden  units,  called  regularization  network.  Therefore  the 
problem  of  learning  becomes  equivalent  to  the  following  minimization  prob¬ 
lem: 

Minimization  problem  for  regularization  networks:  Let  H[f }  be  the 
functional 


ff[/l  =  Di/.-/(x,))2  +  A||Fy/||2,  (1) 

1=1 

where  {(x,,  y,)  €  Rd  x  R]^LX  is  a  given  sparse  data  set,  Py  is  a  linear  operator 
that  is  radially  symmetric  in  the  variable  y  =  Wx,  W  is  a  d  x  d  matrix,  ||  •  || 
is  a  norm  on  the  function  space  which  Py  f  belongs  to,  and  A  is  a  positive 
real  number.  Having  defined  the  function 

/'(*)  =  ^^Gdlx-  t«Hw)  ,  (2) 

0  =  1 

find  {ca}”  =  l.  {to}”.,  and  W  such  that  //[/']  is  minimum. 
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Here  G  is  the  Green's  function  of  PP,  P  indicating  the  adjoint  of  the 
operator  P,  and  the  function  /*  that  minimizes  the  functional  H  is  the 
solution  of  the  approximation  problem.  Equations  (1)  and  (2)  define  the 
class  of  cost  functions  which  <*ur  algorithm  has  been  applied  to. 


3  The  Algorithm 

This  section  is  dedicated  to  a  rather  detailed  presentation  of  the  minimization 
algorithm  that  vve  propose.  We  will  describe,  first,  the  elemental  step  of  the 
procedure,  i.e.  what  can  be  considered  its  core,  and  later  we  will  outline 
how  loops  of  elemental  steps  are  arranged  io  form  the  outer  structure  of  the 
algorithm. 

In  what  follows,  the  outcome  of  the  random  extraction  of  a  real  number 
within  the  interval  [a,b\  will  be  indicated  with  the  scripture  random(a,  b), 
and  the  set  of  the  natural  numbers  greater  than  0  and  smaller  than  n  +  1 
will  be  indicated  with  the  symbol  In. 

Let  g  =  g(x.)  be  a  multivariate  real  function  defined  in  some  domain, 
A,  of  TZn.  Let  x  be  an  internal  point  of  A,  and  let  V  =  {Pi, ...  ,  Pk}  be  a 
partition  of  /„.  Let  us  set,  for  each  element  P,  €  V,  a  positive  number  a?,, 
and  let  Q  be  the  set  of  all  such  numbers.  The  set  Q,  for  reasons  that  will  be 
clear  in  a  moment,  is  called  the  noise  list. 

An  elemental  step  consists  of  a  series  of  k  perturbations  of  the  current 
point,  each  of  them  obtained  by  adding  random  noise  only  to  elements  that 
belong  to  a  certain  subset  of  its  components.  In  particular,  the  j-th  pertur¬ 
bation  of  any  series  will  concern  elements  belonging  to  P7  only.  In  this  sense, 
partition  V  realizes  a  grouping  of  variables  that,  for  some  reason,  may  be 
considered  “similar”.  Any  perturbation  is  then  accepted  -  and  the  current 
point  consequently  updated  -  if,  and  only  if,  the  value  taken  by  the  function 
g  at  the  perturbed  point  is  smaller  than  the  value  of  g  at  the  current  point. 
The  amplitude  of  the  noise  relative  to  the  group  of  components  of  the  cur¬ 
rent  point  that  the  perturbation  has  modified  is  either  doubled,  when  the 
perturbation  is  accepted,  or  it  is  halved  when  the  converse  case  occurs.  More 
formally  : 
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-  gc  is  set  equal  to  g(x)  ; 

-  for  each  element  P,  of  partition  V, 


-  a  n-dimensional  vector,  v,  is  generated  according  to  the  following 

frandom(—u>i,jjx)  if  j  E  P, 

0  otherwise 

where  Vj  indicates  the  j-th  component  of  v  ; 

-  ga  is  set  equal  to  /(x  +  v)  : 

-  if  ga  <  gc ,  then 

-  gc  is  set  equal  to  ga  ; 

-  x  is  set  equal  to  x  -f  v  ; 

-  u>i  is  set  equal  to  2u ; 

else 

-  is  set  equal  to  u>,/2; 


Performing  a  single  step  of  the  algorithm  may  result  either  in  the  update 
of  the  noise  list  or  in  the  update  of  the  noise  list  and  the  current  point.  The 
whole  algorithm  consists,  in  essence,  in  performing  a  series  of  such  steps, 
keeping  track,  meanwhile,  of  the  evolution  of  the  current  point  as  well  as  the 
evolution  of  the  noise  list. 

Since  elements  of  the  noise  list  dictate  the  maximum  amplitude  of  pertur¬ 
bations  that  will  occur  in  the  following  step,  the  noise  list  undergoes,  during 
the  minimization  process,  a  fairly  intricated  dynamics.  However,  as  the  cur¬ 
rent  point  approaches  a  point  of  minimum  for  the  function,  our  expectation 
is  that  the  elements  of  the  noise  list  become  smaller  and  smaller,  although 
their  rates  of  convergence  to  zero  may  differ  considerably  from  each  other. 

Given  these  considerations  it  seems  to  be  sensible  to  stop  the  minimization 
process  once  that  elements  of  the  noise  list  have  reached  values  that  can  be 
considered  small,  and  no  further  relevant  evolution  is  expected  to  take  place. 

How  to  evaluate  whether  the  values  contained  in  the  noise  list  are  small 
enough  is  now  matter  of  choice,  and  we  propose  the  following  criterion  : 
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elements  of  the  noise  list,  are  small  enough  if 

Ui  <  f)  ,V  i  e  h  , 

where  rj  is  a  given  threshold. 

Once  that  the  process  has  stopped,  and  the  final  current  point  is  returned, 
there  is  no  way,  in  general,  to  ascertain  how  close  this  point  is  to  the  actual 
point  of  global  minimum  for  the  function.  Hence,  the  only  possibility  that 
we  are  given  consists  in  restarting  the  procedure  using  a  different  initial 
point.  However,  if  we  believe  that  the  point  we  are  looking  for  is  not  very  far 
from  the  point  that  we  have  reached,  we  may  want  to  restart  the  procedure 
nearby  -  both  in  terms  of  initial  point  and  noise  list  -  the  result  we  have  just 
reached.  A  practical  way  to  perform  this  task  is  to  generate  the  new  noise 
list  by  multiplying  the  elements  of  the  old  one  by  a  given  constant.  The  new 
starting  point  can  be  generated,  now,  by  perturbing  the  old  one  according 
to  the  new  noise  list. 

Results  obtained  by  restarting  and  performing  several  times  the  mini¬ 
mization  procedure  are  then  collected,  and  the  point  where  the  function  has 
reached  its  best  value  is  considered  the  outcome  of  the  whole  algorithm. 

4  Experimental  Results 

Our  algorithm  has  been  tested  on  the  problem  of  finding  the  function,  be¬ 
longing  to  the  class  expressed  by  eq.  (1),  which  minimizes  the  functional  of 
eq.  (2).  As  poggio  and  Girosi  showed  (1989,  1990),  solution  to  this  mini¬ 
mization  problem  best  approximates  data  yi,  i  —  1, . . .  ,n.  Each  experiment 
consisted,  then,  of  the  following  steps: 

-  a  data  set  containing  N  elements  was  generated  by  sampling  at  random 
a  given  multivariate  function  h  =  /i(x); 

-  the  number  n  of  eq.  (2)  was  set,  and  the  number  of  parameters  that 
function  /*  depended  on  consequently  fixed; 

-  a  minimum  for  the  function  //[/*]  -  considered  as  a  function  of  {c0}"_j, 
{ta}o=1  and  W  -  wets  sought; 

-  the  mean  square  differences  between  the  function  h  and  its  approxima¬ 
tion  was  computed  over  a  randomly  generated  test  set. 
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In  order  to  evaluate  the  performances  of  our  algorithm,  we  decided  to 
compare  results  obtained  by  using  it  with  the  ones  obtained  by  using  the 
fairly  classical  method  of  gradient  descent  with  adaptive  step.  One  hundred 
iterations  of  gradient  descent  and  two  thousand  iterations  of  our  method 
both  guaranteed  a  good  degree  of  convergence  and  the  same  computing  time 
for  the  two  methods. 

In  Table  (1)  results  showing  the  average  behavior  of  the  algorithms,  that 
is  the  average  of  the  mean  square  error  on  several  experiments,  are  collected. 

In  Table  (2)  the  best  performances  of  the  algorithms,  that  is  the  lowest 
mean  square  approximation  error  obtained  in  each  group  of  experiments, 
are  reported:  in  this  case  the  ncndeterministic  algorithm  performs  better  of 
gradient  descent,  since  it  has  more  chances  to  attain  the  global  minimum. 


Number  of  variables 

19 

34 

49 

Gradient  Descent 

0.11 

0.11 

0.09 

Adaptive  Noise 

0.18 

0.13 

0.10 

Table  1  :  Errors  affecting  surface  reconstruction  obtained  by  using  the 
method  that  we  propose  and  the  gradient  descent  with  adaptive  step  method 
are  compared.  Each  column  shows  the  averages  of  the  mean  square  approx¬ 
imation  errors  coming  from  a  set  of  10  experiments.  The  function  to  be 
reconstructed  was  h(x,y)  =  e~^OSx+1-2y^cos5xsin3y  and  the  mean  square  ap¬ 
proximation  error  was  computed  over  a  set  of  10,000  points.  Columns  1,  2,  3 
refer  to  groups  of  experiments  performed  by  setting  number  n  of  eq.  (2)  to  5, 
10,  and  15  respectively.  Therefore,  the  functions  to  be  minimized  depended 
on  19,  34,  and  49  variables.  For  each  experiment,  100  iterations  of  gradient 
descent  with  adaptive  step  method,  and  2000  iterations  of  our  method  were 
performed. 


Number  of  variables 

19 

34 

49 

Gradient  Descent 

0.045 

0.057 

0.044 

Adaptive  Noise 

0.015 

0.027 

0.038 

Table  2  :  As  Table  (1),  except  that  it  shows  in  each  column  the  lowest 
mean  square  approximation  error  over  a  group  of  10  experiments. 
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Experience  we  have  been  collecting  by  performing  hundreds  of  running  of 
both  algorithms  leads  us  to  believe  that  the  one  exhibited  in  the  experiments 
reported  is  their  own  typical  behavior. 


5  Remarks 

The  first  aspect  of  oour  algorithm  to  be  noticed  is  that  it  does  not  require  any 
computation  of  the  value  of  the  gradient.  This  crucial,  of  course,  whenever  an 
explicit  expression  of  the  gradient  is  either  unavailable  or  computationally 
expensive.  For  example,  computation  of  the  value  of  the  gradient  for  the 
functions  we  have  performed  experiments  with,  was  10  or  20  times  more 
time  consuming  than  computation  of  the  value  of  the  function. 

Another  aspect  of  the  method  is  that,  unlike  many  minimization  meth¬ 
ods,  it  does  not  rest  on  any  particular  assumptions  about  the  function  to  be 
minimized  and  the  structure  of  its  minima.  By  giving  credit  to  those  per¬ 
turbations  that  yield  a  decrease  of  the  value  of  the  function,  we  perform  an 
estimate  of  the  gradient  that,  though  very  rough,  may  capture  the  essential 
trend  of  the  function.  In  situations  of  interest  of  us.  this  line  of  action  may 
be  recommended,  since  what  we  want  to  avoid  is  to  spend  time  in  computing 
a  quantity  -  the  gradient  -  that  may  end  up  telling  little  about  location  of 
points  of  minimum  for  the  function. 

Let  us  finally  remark  that  the  introduction  of  the  partition  list  allows  us 
to  group  the  variables  according  to  their  typical  magnitudes,  which,  due  to 
the  different  role  variables  play,  may  differ  considerably. 
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